30 research outputs found
Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks
We propose a parallel-data-free voice-conversion (VC) method that can learn a
mapping from source to target speech without relying on parallel data. The
proposed method is general purpose, high quality, and parallel-data free and
works without any extra data, modules, or alignment procedure. It also avoids
over-smoothing, which occurs in many conventional statistical model-based VC
methods. Our method, called CycleGAN-VC, uses a cycle-consistent adversarial
network (CycleGAN) with gated convolutional neural networks (CNNs) and an
identity-mapping loss. A CycleGAN learns forward and inverse mappings
simultaneously using adversarial and cycle-consistency losses. This makes it
possible to find an optimal pseudo pair from unpaired data. Furthermore, the
adversarial loss contributes to reducing over-smoothing of the converted
feature sequence. We configure a CycleGAN with gated CNNs and train it with an
identity-mapping loss. This allows the mapping function to capture sequential
and hierarchical structures while preserving linguistic information. We
evaluated our method on a parallel-data-free VC task. An objective evaluation
showed that the converted feature sequence was near natural in terms of global
variance and modulation spectra. A subjective evaluation showed that the
quality of the converted speech was comparable to that obtained with a Gaussian
mixture model-based method under advantageous conditions with parallel and
twice the amount of data
Label-Noise Robust Multi-Domain Image-to-Image Translation
Multi-domain image-to-image translation is a problem where the goal is to
learn mappings among multiple domains. This problem is challenging in terms of
scalability because it requires the learning of numerous mappings, the number
of which increases proportional to the number of domains. However, generative
adversarial networks (GANs) have emerged recently as a powerful framework for
this problem. In particular, label-conditional extensions (e.g., StarGAN) have
become a promising solution owing to their ability to address this problem
using only a single unified model. Nonetheless, a limitation is that they rely
on the availability of large-scale clean-labeled data, which are often
laborious or impractical to collect in a real-world scenario. To overcome this
limitation, we propose a novel model called the label-noise robust
image-to-image translation model (RMIT) that can learn a clean label
conditional generator even when noisy labeled data are only available. In
particular, we propose a novel loss called the virtual cycle consistency loss
that is able to regularize cyclic reconstruction independently of noisy labeled
data, as well as we introduce advanced techniques to boost the performance in
practice. Our experimental results demonstrate that RMIT is useful for
obtaining label-noise robustness in various settings including synthetic and
real-world noise
Unsupervised Learning of Depth and Depth-of-Field Effect from Natural Images with Aperture Rendering Generative Adversarial Networks
Understanding the 3D world from 2D projected natural images is a fundamental
challenge in computer vision and graphics. Recently, an unsupervised learning
approach has garnered considerable attention owing to its advantages in data
collection. However, to mitigate training limitations, typical methods need to
impose assumptions for viewpoint distribution (e.g., a dataset containing
various viewpoint images) or object shape (e.g., symmetric objects). These
assumptions often restrict applications; for instance, the application to
non-rigid objects or images captured from similar viewpoints (e.g., flower or
bird images) remains a challenge. To complement these approaches, we propose
aperture rendering generative adversarial networks (AR-GANs), which equip
aperture rendering on top of GANs, and adopt focus cues to learn the depth and
depth-of-field (DoF) effect of unlabeled natural images. To address the
ambiguities triggered by unsupervised setting (i.e., ambiguities between smooth
texture and out-of-focus blurs, and between foreground and background blurs),
we develop DoF mixture learning, which enables the generator to learn real
image distribution while generating diverse DoF images. In addition, we devise
a center focus prior to guiding the learning direction. In the experiments, we
demonstrate the effectiveness of AR-GANs in various datasets, such as flower,
bird, and face images, demonstrate their portability by incorporating them into
other 3D representation learning GANs, and validate their applicability in
shallow DoF rendering.Comment: Accepted to CVPR 2021 (Oral). Project page:
https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/ar-gan
Label-Noise Robust Generative Adversarial Networks
Generative adversarial networks (GANs) are a framework that learns a
generative distribution through adversarial training. Recently, their
class-conditional extensions (e.g., conditional GAN (cGAN) and auxiliary
classifier GAN (AC-GAN)) have attracted much attention owing to their ability
to learn the disentangled representations and to improve the training
stability. However, their training requires the availability of large-scale
accurate class-labeled data, which are often laborious or impractical to
collect in a real-world scenario. To remedy this, we propose a novel family of
GANs called label-noise robust GANs (rGANs), which, by incorporating a noise
transition model, can learn a clean label conditional generative distribution
even when training labels are noisy. In particular, we propose two variants:
rAC-GAN, which is a bridging model between AC-GAN and the label-noise robust
classification model, and rcGAN, which is an extension of cGAN and solves this
problem with no reliance on any classifier. In addition to providing the
theoretical background, we demonstrate the effectiveness of our models through
extensive experiments using diverse GAN configurations, various noise settings,
and multiple evaluation metrics (in which we tested 402 conditions in total).
Our code is available at https://github.com/takuhirok/rGAN/.Comment: Accepted to CVPR 2019 (Oral). Project page:
https://takuhirok.github.io/rGAN
WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks
We propose a learning-based filter that allows us to directly modify a
synthetic speech waveform into a natural speech waveform. Speech-processing
systems using a vocoder framework such as statistical parametric speech
synthesis and voice conversion are convenient especially for a limited number
of data because it is possible to represent and process interpretable acoustic
features over a compact space, such as the fundamental frequency (F0) and
mel-cepstrum. However, a well-known problem that leads to the quality
degradation of generated speech is an over-smoothing effect that eliminates
some detailed structure of generated/converted acoustic features. To address
this issue, we propose a synthetic-to-natural speech waveform conversion
technique that uses cycle-consistent adversarial networks and which does not
require any explicit assumption about speech waveform in adversarial learning.
In contrast to current techniques, since our modification is performed at the
waveform level, we expect that the proposed method will also make it possible
to generate `vocoder-less' sounding speech even if the input speech is
synthesized using a vocoder framework. The experimental results demonstrate
that our proposed method can 1) alleviate the over-smoothing effect of the
acoustic features despite the direct modification method used for the waveform
and 2) greatly improve the naturalness of the generated speech sounds.Comment: SLT201
CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion
Non-parallel voice conversion (VC) is a technique for learning the mapping
from source to target speech without relying on parallel data. This is an
important task, but it has been challenging due to the disadvantages of the
training conditions. Recently, CycleGAN-VC has provided a breakthrough and
performed comparably to a parallel VC method without relying on any extra data,
modules, or time alignment procedures. However, there is still a large gap
between the real target and converted speech, and bridging this gap remains a
challenge. To reduce this gap, we propose CycleGAN-VC2, which is an improved
version of CycleGAN-VC incorporating three new techniques: an improved
objective (two-step adversarial losses), improved generator (2-1-2D CNN), and
improved discriminator (PatchGAN). We evaluated our method on a non-parallel VC
task and analyzed the effect of each technique in detail. An objective
evaluation showed that these techniques help bring the converted feature
sequence closer to the target in terms of both global and local structures,
which we assess by using Mel-cepstral distortion and modulation spectra
distance, respectively. A subjective evaluation showed that CycleGAN-VC2
outperforms CycleGAN-VC in terms of naturalness and similarity for every
speaker pair, including intra-gender and inter-gender pairs.Comment: Accepted to ICASSP 2019. Project page:
http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.htm
AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms
This paper describes a method based on a sequence-to-sequence learning
(Seq2Seq) with attention and context preservation mechanism for voice
conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving
sequence modeling such as speech synthesis and recognition, machine
translation, and image captioning. In contrast to current VC techniques, our
method 1) stabilizes and accelerates the training procedure by considering
guided attention and proposed context preservation losses, 2) allows not only
spectral envelopes but also fundamental frequency contours and durations of
speech to be converted, 3) requires no context information such as phoneme
labels, and 4) requires no time-aligned source and target speech data in
advance. In our experiment, the proposed VC framework can be trained in only
one day, using only one GPU of an NVIDIA Tesla K80, while the quality of the
synthesized speech is higher than that of speech converted by Gaussian mixture
model-based VC and is comparable to that of speech generated by recurrent
neural network-based text-to-speech synthesis, which can be regarded as an
upper limit on VC performance.Comment: Submitted to ICASSP201
WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation
WaveCycleGAN has recently been proposed to bridge the gap between natural and
synthesized speech waveforms in statistical parametric speech synthesis and
provides fast inference with a moving average model rather than an
autoregressive model and high-quality speech synthesis with the adversarial
training. However, the human ear can still distinguish the processed speech
waveforms from natural ones. One possible cause of this distinguishability is
the aliasing observed in the processed speech waveform via down/up-sampling
modules. To solve the aliasing and provide higher quality speech synthesis, we
propose WaveCycleGAN2, which 1) uses generators without down/up-sampling
modules and 2) combines discriminators of the waveform domain and acoustic
parameter domain. The results show that the proposed method 1) alleviates the
aliasing well, 2) is useful for both speech waveforms generated by
analysis-and-synthesis and statistical parametric speech synthesis, and 3)
achieves a mean opinion score comparable to those of natural speech and speech
synthesized by WaveNet (open WaveNet) and WaveGlow while processing speech
samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.Comment: Submitted to INTERSPEECH201
StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks
This paper proposes a method that allows non-parallel many-to-many voice
conversion (VC) by using a variant of a generative adversarial network (GAN)
called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it
(1) requires no parallel utterances, transcriptions, or time alignment
procedures for speech generator training, (2) simultaneously learns
many-to-many mappings across different attribute domains using a single
generator network, (3) is able to generate converted speech signals quickly
enough to allow real-time implementations and (4) requires only several minutes
of training examples to generate reasonably realistic-sounding speech.
Subjective evaluation experiments on a non-parallel many-to-many speaker
identity conversion task revealed that the proposed method obtained higher
sound quality and speaker similarity than a state-of-the-art method based on
variational autoencoding GANs
ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
This paper proposes a voice conversion (VC) method using sequence-to-sequence
(seq2seq or S2S) learning, which flexibly converts not only the voice
characteristics but also the pitch contour and duration of input speech. The
proposed method, called ConvS2S-VC, has three key features. First, it uses a
model with a fully convolutional architecture. This is particularly
advantageous in that it is suitable for parallel computations using GPUs. It is
also beneficial since it enables effective normalization techniques such as
batch normalization to be used for all the hidden layers in the networks.
Second, it achieves many-to-many conversion by simultaneously learning mappings
among multiple speakers using only a single model instead of separately
learning mappings between each speaker pair using a different model. This
enables the model to fully utilize available training data collected from
multiple speakers by capturing common latent features that can be shared across
different speakers. Owing to this structure, our model works reasonably well
even without source speaker information, thus making it able to handle
any-to-many conversion tasks. Third, we introduce a mechanism, called the
conditional batch normalization that switches batch normalization layers in
accordance with the target speaker. This particular mechanism has been found to
be extremely effective for our many-to-many conversion model. We conducted
speaker identity conversion experiments and found that ConvS2S-VC obtained
higher sound quality and speaker similarity than baseline methods. We also
found from audio examples that it could perform well in various tasks including
emotional expression conversion, electrolaryngeal speech enhancement, and
English accent conversion.Comment: Published in IEEE/ACM Trans. ASLP
https://ieeexplore.ieee.org/document/911344